Gradient Coding With Dynamic Clustering for Straggler-Tolerant Distributed Learning

نویسندگان

چکیده

Distributed implementations are crucial in speeding up large scale machine learning applications. gradient descent (GD) is widely employed to parallelize the task by distributing dataset across multiple workers. A significant performance bottleneck for per-iteration completion time distributed synchronous GD straggling Coded computation techniques have been introduced recently mitigate stragglers and speed iterations assigning redundant computations In this paper, we introduce a novel paradigm of dynamic coded computation, which assigns data workers acquire flexibility dynamically choose from among set possible codes depending on past behavior. particular, propose coding (GC) with clustering, called GC-DC, regulate number each cluster forming clusters at iteration. With time-correlated behavior, GC-DC adapts behavior over time; iteration, aims as uniformly based straggler For both homogeneous heterogeneous worker models, numerically show that provides improvements average without an increase communication load compared original GC scheme.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Near-Optimal Straggler Mitigation for Distributed Gradient Methods

Modern learning algorithms use gradient descent updates to train inferential models that best explain data. Scaling these approaches to massive data sizes requires proper distributed gradient descent schemes where distributed worker nodes compute partial gradients based on their partial and local data sets, and send the results to a master node where all the computations are aggregated into a f...

متن کامل

Gradient Coding: Avoiding Stragglers in Distributed Learning

We propose a novel coding theoretic framework for mitigating stragglers in distributed learning. We show how carefully replicating data blocks and coding across gradients can provide tolerance to failures and stragglers for synchronous Gradient Descent. We implement our schemes in python (using MPI) to run on Amazon EC2, and show how we compare against baseline approaches in running time and ge...

متن کامل

Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning

Performance of distributed optimization and learning systems is bottlenecked by “straggler” nodes and slow communication links, which significantly delay computation. We propose a distributed optimization framework where the dataset is “encoded” to have an over-complete representation with built-in redundancy, and the straggling nodes in the system are dynamically left out of the computation at...

متن کامل

Entropy-based Consensus for Distributed Data Clustering

The increasingly larger scale of available data and the more restrictive concerns on their privacy are some of the challenging aspects of data mining today. In this paper, Entropy-based Consensus on Cluster Centers (EC3) is introduced for clustering in distributed systems with a consideration for confidentiality of data; i.e. it is the negotiations among local cluster centers that are used in t...

متن کامل

Source-Optimized Clustering for Distributed Source Coding

Motivated by the design of low-complexity distributed quantizers and iterative decoding algorithms that leverage the correlation in the data picked up by a large-scale sensor network, we address the problem of finding correlation preserving clusters. To construct a factor graph describing the statistical dependencies between sensor measurements, we develop a hierarchical clustering algorithm th...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Communications

سال: 2023

ISSN: ['1558-0857', '0090-6778']

DOI: https://doi.org/10.1109/tcomm.2022.3166902